Results of OKKAM Feature based Entity Matching Algorithm for Instance Matching Contest of OAEI 2009
نویسندگان
چکیده
To investigate the problem of entity recognition, we deal with the creation of the so-called Entity Name System (ENS) which is an open, public back-bone infrastructure for the (Semantic) Web that enables the creation and systematic re-use of unique identifiers for entities. The ENS can be seen as a very large, distributed “phonebook for everything”, and ENS identifiers might be considered as a “phone number” of entities. Entity descriptions are based on freeform key/value “tagging” rather than on some precise formalism. However, such a genericity has its shortcomings: the ENS can never know what type of entity it is dealing with. We tackle this problem in a novel approach for entity matching that is called Feature Based Entity Matching (FBEM). In the current paper, we report an evaluation of FBEM on datasets provided by the OAEI committee for the instance matching contest. 1 Presentation of the system With the growth and development of Semantic Web, the latter became like a collection of “information islands” which are poorly integrated to each other. The problem of information integration in Semantic Web is two-fold: 1. heterogeneity of vocabulary: the same concept can be referred via different URIs, and therefore may be considered to be as different concepts in different vocabularies; 2. entity recognition: the same real word object can be referred via different URIs in different repositories, and therefore may not be recognized as the same object. While the first issue is widely recognized and investigated [4], the second one was largely neglected, although it received a lot of attention under the heading of record linkage, data deduplication, entity resolution, etc [1]. To investigate the problem of entity recognition, EU-funded OKKAM project 1 deals with the creation of the so-called Entity Name System (ENS) [3]. 1 http://www.okkam.org 1.1 State, purpose, general statement In this section, we introduce the ENS and describe our interest in instance matching part of OAEI 2009. Entity Name System (ENS) [3] is an open, public back-bone infrastructure for the (Semantic) Web that enables the creation and systematic re-use of unique identifiers for entities. It is implemented as a large-scale infrastructural component with a set of services needed for describing entities, and assigning identifiers to them. Figuratively, the ENS can be seen as a very large, distributed “phonebook for everything”, and ENS identifiers might be considered as a “phone number” of entities. This leads to a more efficient information integration, and thus a real global knowledge space, without the need for ex-post deduplication or entity consolidation. In the ENS, we do not impose or enforce the usage of any kind of schema or strong typing for the description of different types of entities. Instead, entity descriptions are free-form and are based on key/value “tagging”. In such a way, we support a complete genericity, without the need for any formalism or any abstract top-level categorizations. Taking into account the aforementioned peculiarities of the ENS, our restriction to the instance matching part of OAEI 2009 becomes evident. Obviously, our model of such a generic entity description has its shortcomings: the ENS can never know what type of entity it is dealing with, and how the entity is described, due to an absence of a formal model. This becomes very relevant when searching for an entity, a process that we call entity matching. To address this problem, we rely on recent work [2] that has been performed with the goal to find out in an experimental setting how people actually describe (identify) entities. Based on these findings, we propose a novel approach for entity matching. The approach takes into account not only the similarity of entity features (keys and values), but also the circumstance that certain features are more meaningful for identifying an entity than others. We call this approach as Feature Based Entity Model (FBEM) and we explain it in the next section. 1.2 Specific techniques used We consider both a reference (matching) entity Q and candidate (matched) entity E as a set F of features f : F = {f}; f =< n, v >; where each feature f is a pair of name n and value v. We do not require neither name nor value to share a vocabulary or schema, or even a natural language, i.e., they are independent in content and size. We enumerate all features of any particular entity with integer values and denote as f i , f E j the ith and jth features of entities Q and E respectively. We define the following functions: n(fi): returns the name part of a feature of an entity; v(fi): returns the value part. Now, we define fi,jsim(fQ, fE), a function that computes the similarity of two features f i , f E j as follows: fi,jsim(fQ, fE) =def sim f i , f E j ∗ 8 >>>< >>>: 2 ∗ λ ∗ μ, for name(n(f i )), name(n(fE j )), id(f i , fE j ); 2 ∗ μ, for name(n(f i )), name(n(fE j )); λ ∗ μ, for name(n(fE j )), id(f i , fE j ); μ, for name(n(fE j )); 1, otherwise . (1) Equation 1 relies on the following functions and parameters: sim(x, y) : a suitable string similarity measure between x and y. name(x) : a boolean function indicating whether the feature x denotes one of the possible names of the entity; id(x, y) : the identity function, returning true if value parts of x and y are identical; μ : the factor to which a name feature is considered more important than a non-name feature; λ : the extra-factor attributed to the the presence of the value identity id(x, y). In our implementation, we selected Levenstein metric as a similarity measure (simfunction), and both λ and μ equal to 2. The latter can be interpreted as “the occurrence of a fact is as twice as important than its absence”. We have also implemented a vocabulary, small enough to be maintained in a runtime memory, that is used to detect the cases where entity feature name is actually a “name” of the entity, e.g., “name”, “label”, “title”, “denomination”, “moniker”. At this point, we are able to establish the similarity between individual features. To compute the complete feature-based entity similarity, which finally expresses to which extend E is similar to Q, we proceed as follows. Let maxv(V ) be a function that computes the maximum value in a vector2. We then span the matrix M of feature similarities between Q and E, defined as M := (fsim (Q,E))|Q|×|E| → Q ≥ 0 with fsim as defined above, and |Q|, |E| being the number of elements of the vectors Q and E, respectively. The feature-based entity similarity score fs is defined as the sum of all the maximum similar feature combinations between Q and E:
منابع مشابه
Adaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملLily results on SEALS platform for OAEI 2011
This paper presents the alignment results of Lily on SEALS platform for the ontology alignment contest OAEI 2011. Lily is an ontology matching system. In OAEI 2011, Lily submited the results for three matching tasks on the SEALS platform: benchmark, anatomy, conference. The specific techniques used by Lily are introduced. The matching results of Lily are also discussed.
متن کاملResults of AML in OAEI 2017
AgreementMakerLight (AML) is an automated ontology matching system that was developed with both extensibility and efficiency in mind. This paper describes its configuration for the OAEI 2017 competition and discusses its results. For this OAEI edition, we built upon the instance matching foundations we laid last year, and tackled the new Hobbit track and its new evaluation platform. AML was the...
متن کاملOAEI 2016 results of AML
AgreementMakerLight (AML) is an automated ontology matching system based primarily on element-level matching and on the use of external resources as background knowledge. This paper describes its configuration for the OAEI 2016 competition and discusses its results. For this OAEI edition, we tackled instance matching for the first time, thus expanding the coverage of AML to all types of ontolog...
متن کاملPerformance Evaluation of Local Detectors in the Presence of Noise for Multi-Sensor Remote Sensing Image Matching
Automatic, efficient, accurate, and stable image matching is one of the most critical issues in remote sensing, photogrammetry, and machine vision. In recent decades, various algorithms have been proposed based on the feature-based framework, which concentrates on detecting and describing local features. Understanding the characteristics of different matching algorithms in various applications ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009